1
大型語言模型架構的演進:從 BERT 到 GPT 與 T5
AI012Lesson 2
00:00

Transformer 架構的三要素

大型語言模型的演進標誌著一種范式轉移:從專用任務模型過渡到「統一預訓練」,即單一架構能適應多種自然語言處理需求。

這一轉變的核心在於自注意力機制,它使模型能夠權衡序列中不同詞彙的重要性:

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

1. 僅編碼器(BERT)

  • 機制:掩碼語言建模(MLM)
  • 行為:雙向上下文;模型一次「看見」整個句子,以預測被遮蔽的詞語。
  • 最適合應用於:自然語言理解(NLU)、情感分析與命名實體辨識(NER)。

2. 僅解碼器(GPT)

  • 機制:自回歸建模
  • 行為:從左到右處理;僅根據先前的上下文嚴格預測下一個詞元(因果遮蔽)。
  • 最適合應用於:自然語言生成(NLG)與創意寫作。這正是現代大語言模型如 GPT-4 與 Llama 3 的基礎。

3. 編碼器-解碼器(T5)

  • 機制:文字到文字轉換變壓器。
  • 行為:編碼器將輸入字串轉換為密集表示,解碼器則生成目標字串。
  • 最適合應用於:翻譯、摘要與對應任務。
關鍵洞察:解碼器主導性
行業已大幅集中於僅解碼器架構,因其更優越的擴展法則與零樣本情境下的衍生推理能力。
VRAM 上下文視窗影響
在僅解碼器模型中,KV 快取隨序列長度線性增長。10萬上下文視窗所需之 VRAM 显著超過 8 千視窗,因此若無量化技術,本地部署長上下文模型將極具挑戰性。
arch_comparison.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?
Decoders scale more effectively for generative tasks and follow-up instructions via next-token prediction.
Encoders cannot process text bidirectionally.
Decoders require less training data for classification tasks.
Encoders are incompatible with the Self-Attention mechanism.
Question 2
Which architecture treats every NLP task as a "text-to-text" problem?
Encoder-Only (BERT)
Decoder-Only (GPT)
Encoder-Decoder (T5)
Recurrent Neural Networks (RNN)
Challenge: Architectural Bottlenecks
Analyze deployment constraints based on architecture.
If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.
Step 1
Identify the architectural bottleneck regarding context processing.
Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.
Step 2
Justify the preference using Scaling Laws.
Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.